Exploratory data analysis of the Wine Quality Data Set (White wine)

Research question

For this project, we are interested in creating a model to predict subjective wine quality scores, as scored by wine reviewers, based on a set of physicochemical features of the wine.

Summary of the dataset

The data set used in this project is the wine quality data set created by Dr. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. It was sourced from the UCI Machine Learning Repository, which can be found here here. We will be working with the white wine data set for our analysis.

Each row in the data set represents physicochemical properties of the wine (such as fixed acidity, residual sugar, density, pH, etc), as well as a quality rating (based on sensory data) given to the wine.

There are 4898 observations in the data set, and 11 features. There are 0 observations with missing values as can be seen in the cell below.

Partitioning the data into training and test sets

First, we partition the data set such that 75% of the data is in the train set and 25% of the data is in the test set.

Below is a visualization of the number of observations for quality in each of the train and test splits.

As we would have hoped, the distribution of our data remained unchanged after the split. The weight of our distribution, however, lies around 6. Thus, it seems there is data imbalance in our quality variable. If during initial training, we notice that this is affecting our results, we may need to look into methods to address the low number of observations at the ends of the quality rating scale.

Check correlation between features

From this correlation table, we can observe that there is strong positive correlation between density and residual sugar at about 0.84, and strong negative correlation between density and alcohol at about -0.78.

This chart shows that the feature which is correlated most strongly with wine quality ratings is "alcohol". Other features including "density", "chlorides" and "volatile acidity" were also found to have a weak negative correlation with quality ratings.

Distribution of features

The above bar charts demonstrate the distributions of each of the features in the dataset. We can see all of our data is continuous and that each feature follows a typical distribution with few outlier’s present. This data lines up with outside sources, with the pH for example falling within the expected range for most wine of 2.5-4.5. These charts show that some features are relatively uniform for most wines (i.e. chlorides, density, relative sugar) while other features have a greater level of variance between different wines (i.e. alcohol, sulphates, total sulfur dioxide).

Overlap of features

With the above charts we are able to see that different features are more heavily correlated with certain quality ratings. For example, alcohol appears to be a promising feature since we can see that lower ratings (such as 5 and 6) appear to be further skewed to the left in the chart while higher ratings appear to be more skewed to the right. Features such as fixed acidity might not be as useful since there is a lot of overlap between the quality ratings.

Appendix - Pandas Profiling Report

From the report above, we observe that there are no missing values. There are 12 variables, and they are all numerical variables. We have 3673 observations in total in our train portion of the white wine data set.